Probability, Inference, and Genotype Likelihoods

Eric C. Anderson

Conservation Genomics Workshop, Monday August 26, 2024

Overview

Goal: Explore and understand probabilistic genotype calling in the context of some Bayesian statistics

Outline:

  • Estimating an allele frequency from genotypes
    • Bayesian ideas
    • Directed acyclic graphs
    • Monte Carlo
  • Add in uncertainty about genotypes
    • Another layer in the graph
    • Intuition for MCMC
  • The difference between calling genotypes and propagating uncertainty.

We will play around with a Shiny App to cement all three of these ideas (small-group work).

Bayesian Inference

Personal Probability

  • We all have different histories, biases, perspectives, and models that influence how we interpret data.
  • Some simple card play.
  • The take home message is that the probabilities we assign to different events depend on what information we have.

Probability as a measuring stick for uncertainty

  • It can measure a specific person’s degree of uncertainty about anything.
  • Very different from the “frequentist” idea that probabilities must be interpreted as the fraction of times an event would occur if an experiment is performed repeatedly.
  • You can be uncertain about things that will only happen once. What is the probability…
    • that it will rain tomorrow?
    • that next year is the hottest in recorded history?
  • You can even be uncertain about deterministic things
    • Trivia night

It can be interesting to think of probability as a unit of measure for uncertainty.

An Interesting Aside

Probability was not developed as a mathematical framework for quantifying uncertainty

  • It actually had its start in correspondence between mathematicians on games of chance.

Nonetheless, E.T. Jaynes provides an interesting and accessible discussion of how the system of probability as has already been developed does, indeed, arise as the unique system that satisfies a remarkably small number of desirable properties for measuring uncertainty.

Updating Beliefs Probabilistically

  • We saw with the card example that new information leads to updated beliefs
  • If we want to use probability to measure uncertainty, then we should be probabilistic about updating our beliefs.
  • The reverend Thomas Bayes described this in the 1700s.
  • Bayes’ Theorem states the posterior of one hypothesis \(H_i\), out of \(k\), given data \(D\) is: \[ P(H_i|D) = \frac{P(D|H_i)P(H_i)}{\sum_{j = 1}^k P(D|H_j)P(H_j)} \]
  • Boy! I really dislike that expression!

A Simple Conditional Probability, and DAGs

Example 1: Frequency of a genotype given an allele frequency and HWE

  • Suppose the \(A\) allele is at frequency \(p_A\) in a population,
  • And the population is in Hardy-Weinberg equilibrium (HWE)
  • HWE means the frequencies of genotypes are as expected if the two allelic types in an individual are independent of one another.

\[ P(\mathrm{Individual~is}~AA|p_A) = p_A^2 \]

Which we get because the allelic types, \(Y_1\) and \(Y_2\), of the two gene copies are independent of one another, given \(p_A\):

\[ \begin{aligned} P(\mathrm{Individual~is}~AA\;|\;p_A) &= P(Y_1=A\;|\;p_A) P(Y_1=A\;|\;p_A) \\ &= p_A p_A \\ &= p_A^2 \end{aligned} \]

DAG Notation and Terminology

A Picture of Inference




  • What we just talked about was probability “running forward”
  • In a DAG the arrows run from observed nodes to unobserved nodes

  • In most of science we are interested in inference

  • I think of this as probability “running backward”

  • In a DAG the arrows run from unobserved nodes to observed nodes

  • In the DAG on the bottom we have data being genotypes from \(N\) individuals: \[ \boldsymbol{Y} = (Y_{1,1}, Y_{1,2}, Y_{2,1}, \ldots Y_{N,1}, Y_{N,2}) \]


The Probability of the Data and the Likelihood Function

\[ P(\boldsymbol{Y}~|~p_A) = \prod_{i=1}^N P(Y_{i,1}~|~p_A)P(Y_{i,2}~|~p_A) \] In fact, if we define each \(Y\) to be 0 or 1 as follows: \[ \begin{align} Y = 0 & ~~~~~\mbox{with probability} ~~~ 1 - p_A & ~~\mbox{(i.e., it's an}~a) \\ Y = 1 & ~~~~~\mbox{with probability} ~~~ p_A & ~~\mbox{(i.e., it's an}~A) \end{align} \] Then it is not too hard to see that \[ P(\boldsymbol{Y}~|~p_A) = p_A^{\sum Y_{i,j}} (1- p_A)^{2N - \sum Y_{i,j}} \] This is a probability function. But if you consider this as a function of \(p_A\) with \(\boldsymbol{Y}\) considered as fixed, then it is often referred to as the likelihood function.

Methods of Inference

For a concrete example, let \(N = 100\) diploids with 73 copies of the \(A\) allele found.

  • Method of the Eyeball
  • Method of Moments
  • Method of Maximum Likelihood

These all give you a point estimate

An alternative to those, the Bayesian way is to calculate the posterior probability of \(p_A\), given the data.

This simply uses the laws of probability to express our uncertainty about \(p_A\) given the information in the sample.

Some Important Probability Rules

Two events \(A\) and \(B\), and we use \(A\) and \(B\) to refer to the outcome of each

  • \(P(A)\) and \(P(B)\) are referred to as marginal probabilities.

  • The joint probability of \(A\) and \(B\) is the probability that both those two outcomes occurred.

  • Joint prob = marginal probability times a conditional probability: \[ P(A,B) = P(A)P(B~|~A) = P(B)P(A~|~B) \]

  • So, conditional probabilities can be computed from the joint probability: \[ P(A~|~B) = \frac{P(A,B)}{P(B)}~~~~~~~~~~~~~~~~~~~~~~~~ P(B~|~A) = \frac{P(A,B)}{P(A)} \]

  • Thus, a much easier expression for Bayes Theorem: \[ P(A~|~B) \propto P(A, B) \] with \(P(A|B)\) as a function of \(A\) with \(B\) being the fixed “data.”

  • Joint probability typically easy to compute, then just get equality by knowing \(P(A|B)\) must sum to one over values of \(A\).

Bayesian Inference for \(p_A\)

Back to our simple allele frequency example

  • We want the posterior probability, \(P(p_A\;|\;\boldsymbol{Y})\)
  • We know that is proportional to the joint probability, \(P(p_A, \boldsymbol{Y})\)
  • We have the likelihood, \(P(\boldsymbol{Y}\;|\;p_A)\)
  • So, the joint probability could be obtained from that with: \[ P(p_A, \boldsymbol{Y}) = P(\boldsymbol{Y}\;|\;p_A) P(p_A) \]

Whoa! What the heck is the marginal probability, \(P(p_A)\)?

  • This is what is called the prior distribution of \(p_A\).
  • It expresses our degree of belief about the value of \(p_A\) before we look at our data.

Bayes Theorem: The Posterior is proportional to the prior times the likelihood.

A family of priors for \(p_A\)

Since \(p_A\) is a proportion, an obvious choice for prior would be a beta distribution. The beta distribution gives a continuous prior distribution on a value that is between 0 and 1. It has two parameters, often called \(\alpha_1\) and \(\alpha_2\). Here are some examples:

The beta distribution

The beta density for a random variable \(X\) has the form: \[ p(x | \alpha_1, \alpha_2) = \frac{\Gamma(\alpha_1 + \alpha_2)}{\Gamma(\alpha_1)\Gamma(\alpha_2)} x^{\alpha_1 - 1}(1-x)^{\alpha_2 - 1} \] The part that looks hairy is a few Gamma functions. Don’t worry about those—it is a constant. The important part (the “kernel”, as they say…) is: \[ x^{\alpha_1 - 1}(1-x)^{\alpha_2 - 1} \] Or, if we wanted this to a be a prior on \(p_A\), the prior would be proportional to: \[ {p_A}^{\alpha_1 - 1}(1-p_A)^{\alpha_2 - 1} \] And, if we wanted to be even more specific, we could choose \(\alpha_1 = \alpha_2 = 1\) to give ourselves a uniform prior which is proportional to 1: \[ P(p_A) \propto {p_A}^{1 - 1}(1-p_A)^{1 - 1} = {p_A}^{0}(1-p_A)^{0} = 1 \]

Our Inference DAG, Bayesian Version

  • To be a Bayesian, your DAG can’t have any unobserved nodes with no parents.
  • You must add parents which represent the prior.

The posterior distribution of \(p_A\)

The posterior is proportional to the likelihood times the prior

  • Likelihood: \(P(\boldsymbol{Y}~|~p_A) = p_A^{73} (1- p_A)^{127}\)
  • Prior: \(P(p_A) \propto {p_A}^{0}(1-p_A)^{0} = 1\)
  • Joint: \[ \begin{aligned} P(\boldsymbol{Y}, p_A) &\propto p_A^{73} (1- p_A)^{127} \times {p_A}^{0}(1-p_A)^{0} \\ &\propto p_A^{74-1} (1- p_A)^{128-1} \end{aligned} \] Which we recognize as a beta distibution with parameters 74 and 128.

The posterior distribution of \(p_A\)

Here is what the posterior looks like

Shiny Interlude #1

We can’t run shiny on the ConGen RStudio Server

But you can run it on your own computer with RStudio. The procedure for that is:

  1. Open up RStudio.
  2. In the R console paste this code:
if(!("usethis" %in% rownames(installed.packages()))) {
  install.packages("usethis")
}
usethis::use_course("eriqande/ngs-genotype-models")
  1. You will need to answer the “Yes” response to a few questions. This will download an RStudio project and open it.
  2. From this RStudio project’s file browser, open the file 001-allele-freq-estimation.Rmd.
  3. If the message at the top of the file says you need some new packages, click the install option.
  4. Then Click the “Run Document” button.

If that does not work for some of the students: you can go directly to the website: https://eriqande.shinyapps.io/001-allele-freq-estimation/. But, be warned that if too many people use that link it will overwhelm my free ShinyApps account.

Small Groups

Take a few minutes to introduce yourselves to one another, then play with the Shiny app and talk to one another about it as you do. Maybe even work together to do these things:

  • Input beta distribution parameters to get these different shapes:
    • an upward facing U
    • a flat line
    • a curve that keeps going up at one of the boundaries
    • a hill with a peak at 0.3
  • Observe how the posterior distribution changes when sample size changes.
  • With a true allele frequency of \(p_T = 0\), with a sample size of 50 diploids, find values of the prior parameters that will give you a posterior centered on 0.5 (in other words a ridiculously heavy prior…)

Summarizing and Sampling from the Posterior Distribution

To The Bayesian, the Posterior Distribution is Everything

  • The posterior tells you about your uncertainty for a parameter (like \(p_A\)) that captures all the information in the data.
  • Often it gets summarized into the posterior
    • Mean
    • Median
    • Mode
    • Credible interval
  • But, those are all summaries and don’t have all the information that the full posterior gives you.
  • Sometimes posterior distributions are complex and you have to learn about them by sampling from them (unlike in the allele frequency case where we derived it analytically).

Sampling from the Posterior

  • The beta posterior for allele frequency is analytically tractable, but…
  • As an example, we will see what it is like to do “Vanilla” Monte Carlo, by sampling from our beta posterior.
pA <- rbeta(n = 10^6, 74, 128)

The first 10 values of that sample look like:

pA[1:10]
 [1] 0.3948958 0.3513607 0.3483764 0.3763591 0.3004095 0.3471738 0.3584859
 [8] 0.3496734 0.3034359 0.3803256

We call this a “vanilla” Monte Carlo sample because every member of the sample was independent—this is not Markov chain Monte Carlo.

If we wanted to use the sample to approximate the full posterior, we could do that with a histogram:

hist(pA, breaks = 100)

If we want the posterior mean, that is easy, too:

mean(pA)
[1] 0.3662959

as is the posterior median:

median(pA)
[1] 0.3659344

Or the standard deviation of the posterior distribution:

sd(pA)
[1] 0.03379317

Or, the 90%-equal-tail Credible Interval

quantile(pA, probs = c(0.05, 0.95))
       5%       95% 
0.3113770 0.4226427 

All of those quantities could have been obtained analytically in this case, but it is a lot simpler to just work with a Monte Carlo sample because the operations are the same with every Monte Carlo sample (which is not necessarily true of every analytical distribution…).

Genotype Likelihoods and Inference From Read Data

Probabilistic Genotype Calling / Allele frequency estimation

We will now complicate our simple allele frequency model

  • First, we are looking at a single known SNP location, where the base in the sequence is either a \(C\) or a \(T\).
  • We no longer observe the genotype without any error or uncertainty.
  • Rather, we observe the sequencing reads, each one a copy of one or the other of the two homologous gene copies of the genotype.
  • There is also some sequencing error!

We Have Merely Added an Extra Layer of Nodes in our DAG

  • \(\boldsymbol{R}_{i}\) is a two-vector that records the number of \(C\) and \(T\) reads in individual \(i\).

The conditional probability \(P(\boldsymbol{R}_i~|~Y_{i,1}, Y_{i,2}, \mu)\)

This thing turns out to be pretty easy. It can be broken down into cases:

  1. \(i\) is homozygous for \(C\) (i.e., \(Y_{i,1} = Y_{i,2} = C\)):

  2. \(i\) is homozygous for \(T\) (i.e., \(Y_{i,1} = Y_{i,2} = T\)):

  3. \(i\) is heterozygous (i.e., \((Y_{i,1} = C, Y_{i,2} = T)\)

These probabilities, when viewed as a function of the genotype, with the read data fixed, are called the genotype likelihoods.

A picture of this

image/svg+xml ...ACGGTCACAG?TTACAGTTAGCG... ...ACGGTCACAG?TTACAGTTAGCG... Two homologous chromosomes withinan individual Position of known SNP inthe species, with allelesT and C The possible genotypes are: ...ACGGTCACAGCTTACAGTTAGCG... ...ACGGTCACAGCTTACAGTTAGCG... ...ACGGTCACAGCTTACAGTTAGCG... ...ACGGTCACAGTTTACAGTTAGCG... ...ACGGTCACAGTTTACAGTTAGCG... ...ACGGTCACAGTTTACAGTTAGCG... CC CT or TC TT ...ACGGTCACAGTTTACAGTTAGCG... ...ACGGTCACAGCTTACAGTTAGCG... The data are: 4 reads covering that site, andthe associated base quality scores CAGCTTACA ACAGCT GTTTA AGCTTACAG Read Observed Base C C T C PHRED-scaled basequality score Read# 1 2 3 4 32 (A) 37 (F) 35 (D) 33 (B)

Some Intuition of How to Sample from the Posterior (MCMC)

Get a sample for allele frequency and the latent genotypes

What is the Markov Chain part of MCMC?

Playing with genotype likelihoods and genotype posteriors

The genotype likelihoods can be calculated from the read data alone.

But, if we combine that with allele frequencies in our model:

…then we can also compute the posterior probability (by MCMC sampling) of each genotype. If you are sampling from a single population, then this provides a better estimate of the true genotype.

  • Also, it gives a better estimate of the allele frequency than simply assigning genotypes based on the likelihood alone.

  • Coming up, we have another Shiny app to play with this.

Shiny Interlude #2

From the ngs-genotype-models RStudio project that you downloaded previously:

  • Open 002-genotype-likelihoods-from-reads.Rmd
  • Install any packages that RStudio tells you that you might need.
  • Hit the “Run Document” Button

If this doesn’t work for you, then you can access the Shiny app over the web: https://eriqande.shinyapps.io/002-genotype-likelihoods-from-reads/

  • Then you can simulate genotypes, as before, but now, you can simulate reads from those genotypes, and then compute genotype posteriors using MCMC.

Back to groups…some questions

  1. What are the likelihoods for the three different possible genotypes when the read depth is 0 for an individual?
  2. What does it take for the likelihood to be highest for the heterozygote hypothesis?
  3. Is it more likely that a true heterozygote genotype will have a posterior probability that is highest for the hyothesis of “common homozygote” or the hypotheses of “rare homozygote.”
  4. How does the posterior distribution of the allele frequency computed from read data (the transparent blue histogram) compare to the posterior distribution if you know the genotypes exactly (the black line)? How does this change when read depth is increased or decreased?
  5. If you only have a single read from a heterozygous individual, will that individual’s maximum likelihood genotype ever be “heterozygote.” What about its maximum a-posteriori genotype? What are the conditions that lead to the heterozygous individual with only a single read having a high posterior probability of being a heterozygote?
  6. When read depths are low, even if you are calling genotypes using the highest posterior probability, do you expect the results to be very accurate?

There are more questions for thought at the bottom of the Shiny App Notebook, too.

Why do we care about this?

Whoa! Where’s my heterozygotes at?

R-package for assessing trends of departures from HWE where there shouldn’t be any:

https://github.com/eriqande/whoa

Survey of RAD data sets showed considerable issues:

Here are all the data sets I investigated, ordered by \(m\) from smallest to largest.

bonnethead_shark, \(m = 0.01\)

wak_chinook, \(m = 0.02\)

wifl, \(m = 0.03\)

red_drum, \(m = 0.05\)

anguilla, \(m = 0.14\)

chinook_hecht, \(m = 0.17\)

rostrata, \(m = 0.23\)

lobster, \(m = 0.25\)

anchovy, \(m = 0.28\)

snails, \(m = 0.45\)

chinook_gatk, \(m = 0.50\)

dolphin_L_albirostris, \(m = 0.65\)

dolphin_L_acutus, \(m = 0.72\)

The culprit in some cases was low read-depth and “hard-cutoff” genotype calling:

One last exercise if we have time

Propagating Uncertainty

You will often hear that using genotype likelihoods allows you to “propagate uncertainty” to downstream analyses.

This is good practice—it can help you to not underestimate your uncertainty.

Shiny Interlude #3

Our third exercise with a ShinyApp addresses this question, revealing some of the unfortunate things that can happen if you call genotypes from low coverage sequencing data and you treat them as known/certain.

From the ngs-genotype-models RStudio project that you downloaded previously:

  • Open 003-read-inference-gsi.Rmd
  • Install any packages that RStudio tells you that you might need.
  • Hit the “Run Document” Button

(If this doesn’t work for you, then you can access the Shiny app over the web: https://eriqande.shinyapps.io/003-read-inference-gsi/ )